Skip to content

Conversation

@ammar-agent
Copy link
Collaborator

@ammar-agent ammar-agent commented Oct 13, 2025

Problem

Integration test should include mode-specific instructions in system message is flaky (#224). When testing mode switches mid-conversation, models sometimes respond with the wrong mode marker ([PLAN_MODE_ACTIVE] instead of [EXEC_MODE_ACTIVE]).

Root cause: Models see conflicting signals:

  • System message says "You're in EXEC mode"
  • Conversation history shows they just said they were in PLAN mode
  • User prompt is vague ("Please respond.") with no transition signal

Models sometimes prioritize conversation consistency over system instructions.

Solution

Inject mode transition as a temporal user message in the conversation flow.

When mode changes, insert a synthetic user message before the final user message:

[Mode switched from plan to exec. Follow exec mode instructions.]

Benefits:

  • ✅ Temporal - transition happens in natural conversation flow
  • ✅ Models handle in-message context better than system changes
  • ✅ Simple - no metadata persistence complexity
  • ✅ Works for both tests and production usage

Implementation

  • Added injectModeTransition() to modelMessageTransform.ts
  • Operates on CmuxMessage[] where metadata.mode is available
  • Called after addInterruptedSentinel, before converting to ModelMessage
  • Mode persisted in assistant message metadata for next request
  • Added comprehensive unit tests (5 test cases)

Testing

  • ✅ All unit tests pass
  • ✅ CI will verify improved integration test reliability
  • ✅ Works for real-world mode switches mid-conversation

Closes #224

- Add 'mode' field to CmuxMetadata to track mode per message
- Detect mode switches by comparing with last assistant message
- Inject explicit transition note when mode changes mid-conversation
- Helps models understand they should follow new mode instructions

Addresses #224 - flaky mode-specific instructions test
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Changed approach: inject mode switch as synthetic user message instead
of system instruction enhancement.

Benefits:
- More temporal - transitions happen in conversation flow
- Models handle in-message context better than system changes
- Simpler - no need to persist mode across stream lifecycle
- Avoids metadata persistence issue caught by Codex

The synthetic message is inserted before the last user message when
mode changes, providing natural context for the model.

Implementation:
- Added injectModeTransition() to modelMessageTransform.ts
- Operates on CmuxMessage[] where metadata is available
- Inserts synthetic user message: '[Mode switched from X to Y]'
- Called after addInterruptedSentinel, before conversion to ModelMessage

Co-authored-by: Codex (review feedback)
- Added 5 unit tests for injectModeTransition()
- Fixed edge case: don't inject when no user messages exist
- Pass mode to StreamManager so it persists in final history
- Updated PR description to be clearer and more concise

Co-authored-by: Codex (persistence fix)
Copy link
Member

@ammario ammario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've verified conversationally much better agent knowledge of modes with this fix

@ammario ammario merged commit fa61ff6 into main Oct 13, 2025
7 checks passed
@ammario ammario deleted the flakes branch October 13, 2025 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🤖 Flaky integration test: OpenAI mode-specific instructions not respected

2 participants